Search CORE

18 research outputs found

A methodology for efficient code optimizations and memory management

Author: Djemame Karim
Kelefouras Vasileios
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/05/2018
Field of study

The key to optimizing software is the correct choice, order as well parameters of optimizations-transformations, which has remained an open problem in compilation research for decades for various reasons. First, most of the compilation subproblems-transformations are interdependent and thus addressing them separately is not effective. Second, it is very hard to couple the transformation parameters to the processor architecture (e.g., cache size and associativity) and algorithm characteristics (e.g. data reuse); therefore compiler designers and researchers either do not take them into account at all or do it partly. Third, the search space (all different transformation parameters) is very large and thus searching is impractical. In this paper, the above problems are addressed for data dominant affine loop kernels, delivering significant contributions. A novel methodology is presented that takes as input the underlying architecture details and algorithm characteristics and outputs the near-optimum parameters of six code optimizations in terms of either L1,L2,DDR accesses, execution time or energy consumption. The proposed methodology has been evaluated to both embedded and general purpose processors and for 6 well known algorithms, achieving high speedup as well energy consumption gain values over gcc compiler, hand written optimized code and Polly

Crossref

Sheffield Hallam University Research Archive

Plymouth Electronic Archive and Research Library

White Rose Research Online

Combining software cache partitioning and loop tiling for effective shared cache management

Author: Kelefouras Vasileios
Keramidas Georgios
Voros Nikolaos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/05/2018
Field of study

One of the biggest challenges in multicore platforms is shared cache management, especially for data dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack of efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the co-running workloads. To the best of our knowledge, this is the first time that a methodology provides i) a theoretical foundation in the above mentioned cache management mechanisms and ii) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions in a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity) and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)- optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space

Sheffield Hallam University Research Archive

Plymouth Electronic Archive and Research Library

Cache partitioning + loop tiling: A methodology for effective shared cache management

Author: Kelefouras Vasileios
Keramidas Georgios
Voros Nikolaos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2017
Field of study

In this paper, we present a new methodology that provides i) a theoretical analysis of the two most commonly used approaches for effective shared cache management (i.e., cache partitioning and loop tiling) and ii) a unified framework to fine tuning those two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by one order of magnitude keeping at the same time the number of arithmetical/addressing instructions in a minimal level. We also present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space

Crossref

Sheffield Hallam University Research Archive

Plymouth Electronic Archive and Research Library

White Rose Research Online

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

Author: Goutis Costas
Kelefouras Vasileios
Kritikakou Angeliki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2014
Field of study

In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in thememory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space

Sheffield Hallam University Research Archive

Efficient FPGA implementations of volterra DFES for optical systems

Author: Emeretlis Andreas
Glentis Othon
Kelefouras Vasileios
Theodoridis George
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

In this work suitable architectures and high-throughput FPGA implementations of Volterra Decision Feedback Equalizers (VDFEs) for optical communication links are presented. Two VDFE configurations were selected based on the available resources of the employed FPGA devices, and two multiplexer-based architectures were developed for each of them in order to achieve the target throughput. The comparison of the experimental results with respect to different VDFE configurations, throughput, and FPGA devices points out the platform-specific design characteristics. The introduced architectures meet the desired 10Gb/s throughput, so it is demonstrated that the FPGA is a suitable platform for high-speed optical fiber communication systems

Crossref

Sheffield Hallam University Research Archive

Array size computation under uniform overlapping and irregular accesses

Author: Catthoor Francky
Goutis Costas
Kelefouras Vasileios
Kritikakou Angeliki
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

The size required to store an array is crucial for an embedded system, as it affects the memory size, the energy per memory access, and the overall system cost. Existing techniques for finding the minimum number of resources required to store an array are less efficient for codes with large loops and not regularly occurring memory accesses. They have to approximate the accessed parts of the array leading to overestimation of the required resources. Otherwise, their exploration time is increased with an increase over the number of the different accessed parts of the array. We propose a methodology to compute the minimum resources required for storing an array which keeps the exploration time low and provides a near-optimal result for regularly and non-regularly occurring memory accesses and overlapping writes and reads

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Sheffield Hallam University Research Archive

HAL-Rennes 1

FPGA implementation of a MIMO DFE in 40 GB/S DQPSK optical links

Author: Emeretlis Andreas
Georgoulakis Kwnstantinos
Glentis Othon
Kelefouras Vasileios
Nanou Maki
Theodoridis George
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/12/2015
Field of study

In this paper, an FPGA implementation of a Multi Input Multi Output (MIMO) Decision Feedback equalizer (DFE) is proposed, for the electronic compensation of the impairments in 40Gb/s Intensity Modulated Direct Detection (IM/DD) optical communication links employing NRZ DQPSK signaling. The proposed equalizer is used for the electronic compensation the residual Chromatic Dispersion (CD) along the installed optically compensated optical paths. The required processing rate is achieved by applying intensive pipelining and parallelism in the original architecture of the equalizer. At the given processing rate, a 8-input 2-output DFE involving three taps feedforward filtering and two taps backward filtering is implemented on a single, cutting edge technology, Xilinx FPGA device

Crossref

Sheffield Hallam University Research Archive

A template-based methodology for efficient microprocessor and FPGA accelerator co-design

Author: Athanasiou George
Catthoor Francky
Goutis Costas E.
Kelefouras Vasileios
Kritikakou Angeliki
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2012
Field of study

Embedded applications usually require Software/Hardware (SW/HW) designs to meet the hard timing constraints and the required design flexibility. Exhaustive exploration for SW/HW designs is a very time consuming task, while the adhoc approaches and the use of partially automatic tools usually lead to less efficient designs. To support a more efficient codesign process for FPGA platforms we propose a systematic methodology to map an application to SW/HW platform with a custom HW accelerator and a microprocessor core. The methodology mapping steps are expressed through parametric templates for the SW/HW Communication Organization, the Foreground (FG) Memory Management and the Data Path (DP) Mapping. Several performance-area tradeoff design Pareto points are produced by instantiating the templates. A real-time bioimaging application is mapped on a FPGA to evaluate the gains of our approach, i.e. 44,8% on performance compared with pure SW designs and 58% on area compared with pure HW designs

Lirias

Crossref

Sheffield Hallam University Research Archive

Illegal logging detection based on acoustic surveillance of forest

Author: Kelefouras Vasileios
Mporas I.
Paraskevas M.
Perikos I.
Publication venue: 'MDPI AG'
Publication date: 21/10/2020
Field of study

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. In this article, we present a framework for automatic detection of logging activity in forests using audio recordings. The framework was evaluated in terms of logging detection classification performance and various widely used classification methods and algorithms were tested. Experimental setups, using different ratios of sound-to-noise values, were followed and the best classification accuracy was reported by the support vector machine algorithm. In addition, a postprocessing scheme on decision level was applied that provided an improvement in the performance of more than 1%, mainly in cases of low ratios of sound-to-noise. Finally, we evaluated a late-stage fusion method, combining the postprocessed recognition results of the three top-performing classifiers, and the experimental results showed a further improvement of approximately 2%, in terms of absolute improvement, with logging sound recognition accuracy reaching 94.42% when the ratio of sound-to-noise was equal to 20 dB

Sheffield Hallam University Research Archive

Plymouth Electronic Archive and Research Library

University of Hertfordshire Research Archive

Area-throughput trade-offs for SHA-1 and SHA-256 hash functions’ pipelined designs

Author: Athanasiou George S.
Goutis Costas E.
Kelefouras Vasileios
Michail Harris E.
Stouraitis Thanos
Theodoridis George
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 10/12/2015
Field of study

High-throughput designs of hash functions are strongly demanded due to the need for security in every transmitted packet of worldwide e-transactions. Thus, optimized and non-optimized pipelined architectures have been proposed raising, however, important questions. Which is the optimum number of the pipeline stages? Is it worth to develop optimized designs or could the same results be achieved by increasing only the pipeline stages of the non-optimized designs? The paper answers the above questions studying extensively many pipelined architectures of SHA-1 and SHA-256 hashes, implemented in FPGAs, in terms of throughput/area (T/A) factor. Also, guides for developing efficient security schemes designs are provided. Read More: https://www.worldscientific.com/doi/abs/10.1142/S021812661650032

Ktisis

Sheffield Hallam University Research Archive

Plymouth Electronic Archive and Research Library